In 2019, the American Economic Association updated its Data and Code Availability Policy, which now requires that the AEA Data Editor verify the reproducibility of all papers before they are accepted by an AEA journal. In addition to the requirements laid out in the policy, several specific recommendations were produced to facilitate compliance. This change in policy is expected to improve the computational reproducibility of all published research going forward, after several studies showed that rates of computational reproducibility in economics at large range from somewhat low to alarmingly low (Galiani, Gertler, and Romero 2018; Chang and Li 2015; Kingi et al. 2018).
Replication, or the process by which a study’s hypotheses and findings are re-examined using different data or different methods (or both) (King 1995) is an essential part of the scientific process that allows science to be “self-correcting.” Computational reproducibility, or the ability to reproduce the results, tables, and other figures using the available data, code, and materials, is a precondition for replication. Computational reproducibility is assessed through the process of reproduction. At the center of this process is the reproducer (you!), a party not involved in the production of the original paper. Reproductions sometimes involve the original author (whom we refer to as “the author”) in cases where additional guidance and materials are needed to execute the process.
This exercise is designed for reproductions performed in economics graduate courses or undergraduate theses, with the goal of providing a common approach, terminology, and standards for conducting reproductions. The goal of reproduction, in general, is to assess and improve the computational reproducibility of published research in a way that facilitates further robustness checks, extensions, collaborations, and replication.
This exercise is part of the Accelerating Computational Reproducibility in Economics (ACRE) project led by the Berkeley Initiative for Transparency in the Social Sciences (BITSS) and Prof. Lars Vilhuber, Data Editor for the journals of the American Economic Association (AEA). ACRE looks to assess, enable, and improve the computational reproducibility of published economics research.
Assessments of reproducibility can easily gravitate towards binary assessments that declare an entire paper “reproducible” or “non-reproducible.” These guidelines suggest a more nuanced approach by highlighting two reasons that make binary judgments less relevant.
First, a paper may contain several scientific claims (or major hypothesis) that may vary in computational reproducibility. Each claim is tested using different methodologies, where results are presented in one or more display items (outputs like table and figures). Each display item will itself contain several specifications. Figure 0.1 illustrates this idea.
Figure 0.1: One paper has multiple components to reproduce.
DI: Display Item, S: Specification
Second, for a given specification there are several levels of reproducibility, ranging from the absence of any materials to complete reproducibility starting from raw data. And even for a specific claim-specification, distinguishing the appropriate level can be far more constructive than simply labeling it as (ir)reproducible.
Note that the highest level of reproducibility, which requires complete reproducibility starting from raw data, is very demanding to achieve and should not be expected of all published research – especially before 2019. Instead, this level can serve as an aspiration as the field of economics at large seeks to improve the reproducibility of research and facilitate the transmission of knowledge throughout the scientific community.
This reproduction exercise is divided into four stages, corresponding to the first four chapters of these guidelines, with a fifth optional stage:
Extension (if applicable), where you may extend the current paper by including new methodologies or data. This step brings the reproduction exercise a step closer to replication.
Figure 2: Steps for reproduction
(1) (2) (3) (4) (5)
scope --> assess --> improve --> robust --> extend
▲ | | ▲
| | | |
|_________| |___________________|
Suggested level of effort:
- Graduate
research: 5% 10% 5% 10% 70%
- Graduate
course: 10% 25% 20% 40% 5%
- Undergrad
thesis: 10% 30% 40% 20% 0%Figure 2 depicts suggested levels of effort for each stage of the exercise depending on the context in which you are performing a reproduction. This process need not be chronologically linear. For example, you may realize that the scope of a reproduction is too ambitious and switch to a less intensive one. Later in the exercise, you can also begin testing different specifications for robustness while also assessing a paper’s level of reproducibility.
You will be asked to record the results of their reproduction progress through each stage.
In Stage 1: Scoping, complete Survey 1, where you will declare your paper of choice and the specific display item(s) and specifications on which you will focus for the remainder of the exercise. This step may also involve writing a brief 1-2 page summary of the paper (confirm this with your instructor).
In Stage 2: Assessment, you will inspect the paper’s reproduction package (raw data, analysis data, and code), connect the display item to be reproduced with its inputs, and assign a reproducibility score to each output.
In Stage 3: Improvement, you will try to improve the reproducibility of the selected outputs by adding missing files, documentation, and report any potential changes in the level of reproducibility. Use Survey 2 to record your work at Stages 2 and 3 (you will receive access instructions for Survey 2 when you submit Survey 1).
In Stage 4: Robustness Checks, you will assess different analytical choices and test possible variations. Use Survey 3 to record your work at this stage.
Generally, a reproduction will begin with a thorough reading of the study being reproduced. However, subsequent steps may follow from a reproduction strategy. For example, a reproduction may closely follow the order of the steps outlined above, with the reproducer first choosing a set of results they are interested in assessing or understanding the production of, completely reproducing these results to the extent possible, and then making modifications to the reproduction package. Another potential strategy may be for the reproducer to develop potential robustness checks or extensions while reading the study, which leads to the definition of a set of results to be assessed via reproduction. Yet another reproduction strategy may be for the reproducer to seek out a paper that uses a particularly data set they have access to or are interested in using and reproduce the all of the results that use that data set as an input, then probe the robustness of the results to various data cleaning decisions.
The many potential uses of reproduction to various ends makes the number of potential reproduction strategies very large. In choosing or designing a reproduction strategy, it is helpful to clearly identify the goal of the reproduction. In all of the examples in the above paragraph, the order in which the steps of the reproduction exercise are taken is at least partially determined by what the reproducer hopes to get from the exercise. The structure provided in these guidelines, together with a clear reproduction goal, can facilitate the implementation of an efficient reproduction strategy.
In this stage, you will define the scope of the exercise by declaring a paper and the specific output(s) on which you will focus on the remainder of the exercise. But before you decide to move forward with the paper that you will analyze in the remainder of the exercise (we refer to this as the “declared paper”), you may first consider a few other papers, but not analyze them closer (we refer to those as “candidate papers”).
Most likely, you will choose a declared paper based on whether or not you can locate its reproduction package. We define a reproduction package (in other contexts referred to as a “replication package”) as the collection of all materials that make it possible for a reproducer to reproduce the paper. This package may contain data, code, and/or documentation. If you are unable to independently locate the reproduction package for your paper, you can ask for it from the author of the paper (find guidance on how to do so in Chapter 6) or simply choose another candidate paper. For the sake of avoiding duplication of effort of others who may be interested in reproducing one of your candidate papers, we ask that you record your candidate papers in the ACRE database (currently under development). If you still want to explore the reproducibility of a paper with no reproduction package, these guidelines will provide instructions on how to contact the authors with a specific request for materials to create a public reproduction package, or if this route proves unsuccessful, on how to build your reproduction package from scratch.
Note that in this stage, you are not expected to review the reproduction materials in detail, as you will dedicate most of your time to this in later stages of the exercise. If materials are available, you will read the paper and declare the scope of the reproduction exercise. You can expect to spend between 1-3 days in the Scoping stage, though this may vary based on the length and the complexity of the paper, and the availability of reproduction materials.
Use Survey 1 to record your work in this stage.
At this point of the exercise, you are only validating the existence of (at least) one reproduction package and not assessing the quality of its content. Follow the five steps to verify the existence of a reproduction package, and stop whenever you find it (which would mean that you have found your declared paper).
Original reproduction package for - Title of the paper. You will be asked to provide the URL of the repository in Survey 1.In case you need to contact the authors, make sure to allocate sufficient time for this step (we suggest at least three weeks before the date when you plan to start the reproduction). Instructors should also plan to accordingly (e.g., if the ACRE exericse is expected to take place in the middle of the semester, students should review candidate papers and (if applicable) contact the authors in the first few weeks of the semester).
Review the decision tree (Figure #) below for a more detailed overview of this process. Remember, if at any step of the process you decide to abandon the paper, make sure to record the candidate paper in the ACRE database before moving to another candidate paper. Once you have obtained the reproduction package, the candidate paper becomes your declared paper and you can move forward with the exercise! Do not invest time in doing a detailed read of any paper until you are sure that it is your declared paper.
If the ACRE database contains previous reproduction attempts of the paper, you will see a report card with the following information:
Box 1: Summary Report Card for ACRE Paper Entry
Title: Sample Title
Authors: Jane Doe & John Doe
Original Reproduction Package Available: URL/No [What does this mean? Add some context]. [If “No”] Contacted Authors?: Yes/No
[If “Yes(contacted)”] Type of Response: Categories (6).
Additional Reproduction Packages: Number (eg., 2)
Authors Available for Further Questions for ACRE Reproductions: Yes/No/Unknown
Open for reproductions: Yes/No [Same as above: what does this mean? Add more context].
If after going through steps 1-5 above (or for other reason) you were unable to locate the reproduction package, record your candidate paper (and if applicable, the outcome of your correspondence with the original authors) in the ACRE database following the example above.
View Decision Tree To Select Paper (Emma: add title and solve bug with svg)
Once you have identified your declared paper, it is time to get familiarized with the paper and decide on the specific output(s) on which you will focus on the remainder of the exercise. The following sections in this chapter will show you how to do that.
Depending on how much time you have, we recommend that you write a short (1-2 page) summary of the paper. This will help remind you of the key elements to focus on for the reproduction, and to demonstrate your understanding of the paper (for yourself and others like your instructor/advisor).
When reading/summarizing the paper, try to answer the following questions:
By now you should have a fairly good understanding of the content of the paper. You do not, however, need to have spent any time reviewing the reproduction package in detail.
At this point, you should clearly specify which part of the paper will be the main focus of your reproduction. Focus on specific estimates, represented by a unique combination of claim-display item-specification as represented in 0.1. If you plan to scope more than one claim, we strongly recommend starting with just one and recording your results. You can then initiate another record in ACRE later for the second (or third, etc.) claim to reproduce, using the materials and knowledge you developed in the first exercise. You can, however, reproduce more than one claim if you are already familiar with the paper.
In the Assessment stage, the reproduction will be centered around the display item(s) that contain the specification you indicate at this point.
Identify one of the scientific claims, and its corresponding preferred specification, and record its magnitude, standard error, and location in the paper (page, table #, and row and column in the table). If the authors did not explicitly chose a particular estimate, you will be asked to select one. In addition to the preferred estimate, reproduce up to five estimates that correspond to alternative specifications of the preferred estimate.
After reading the paper, you might wonder why the authors did not conduct a specific robustness test. If you think that such analysis could have been done within the same methodology, and using the same data (eg., including/excluding a subset of the data like “high-school dropouts” or “women”), please specify a robustness test that you would like to test before starting the assessment stage.
These are the elements you will need to conduct the scoping stage. You now have all the elements necessary to complete Survey 1.
Before you begin working on the three main stages of the reproduction exercise (Assessment, Improvement, and Robustness), it is important to manage expectations (yours and those of your instructor/advisor). Be mindful of your time limitations when defining the scope of your reproduction activity. These will depend on the type of exercise chosen by your instructor/advisor and may vary from a homework assignment (e.g., over a couple of weeks), to a longer class project that may take a month to complete, or a semester-long project (for example as an undergraduate thesis).
Table 1 shows a tentative distribution of time across three different reproduction formats. The Scoping and Assessment stages are expected to last roughly the same amount of time across all formats (lasting longer for the semester-long activities and expecting less experience with research if the reproducer is an undergraduate student). Differences emerge in the distribution of time for the last two main stages: Improvements and Robustness. For shorter exercises, we recommend staying away from any possible improvements to the raw data (or cleaning code). This will limit how many robustness checks are possible (for example, by limiting your ability to reconstruct variables according to slightly different definitions), but it should leave plenty of time for testing different specifications at the analysis level.
Emma: please write this table using R and KableExtra
|
2 weeks (~10 days) |
1 month (~20 days) |
1 semester (~100 days) |
||||
|---|---|---|---|---|---|---|
| analysis data | raw data | analysis data | raw data | analysis data | raw data | |
| Scoping | 10% (1 day) | 5% (1 day) | 5% (5 days) | |||
| Assessment | 35% | 25% | 15% | |||
| Improvement | 25% | 0% | 40% | 20% | 30% | |
| Robustness | 25% | 5% | 25% | 25% | ||
library(tidyverse)
library(knitr)
library(kableExtra)
temp_eval <- FALSE
In this stage, you will review and describe in detail the available reproduction materials, and assess levels of computational reproducibility for the selected outputs, as well as for the overall paper. This stage is designed to record as much of the learning process behind a reproduction as possible to facilitate incremental improvements, and allow future reproducers to pick up easily where others have left off.
First, you will provide a detailed description of the reproduction package. Second, you will connect the outputs you’ve chosen to reproduce with their corresponding inputs. With these elements in place, you can score the level of reproducibility of each output, and report on paper-level dimensions of reproducibility.
In the Scoping stage, you declared a paper, identified the specific claims you will reproduce, and recorded the main estimates that support the claims. In this stage, you will identify all outputs that contain those estimates. You will also decide if you are interested in assessing the reproducibility of that entire output (e.g., “Table 1”), or will assess only a pre-specified estimates (e.g., “rows 3 and 4 of Table 1”). Additionally, you can include other outputs of interest.
Use Survey 2 to record your work as part of this step.
Tip: We recommend that you first focus on one specific output (e.g., “Table 1”). After completing the assessment for this output, you will have a much easier time translating improvements to other outputs.
This section explains how to list all input materials found or referred to in the reproduction package. First, you will identify data sources and connect them with their raw data files (when available). Second, you will locate and provide a brief description of the analytic data files. Finally, you will locate, inspect, and describe the analytic code used in the paper.
The following terms will be used in this section:
Cleaning code: A script associated primarily with data cleaning. Most of its content is dedicated to actions like deleting variables or observations, merging data sets, removing outliers, or reshaping the structure of the data (from long to wide, or vice versa).
Analysis code: A script associated primarily with analysis. Most of its content is dedicated to actions like running regressions, running hypothesis tests, computing standard errors, and imputing missing values.
In the paper you chose, find references to all data sources used in the analysis. A data source is usually described in narrative form. For example, if in the body of the paper you see text like “…for earnings in 2018 we use the Current Population Survey…”, the data source is “Current Population Survey 2018”. If it is mentioned for the first time on page 1 of the Appendix, its location should be recorded as “A1”. Do this for all the data sources mentioned in the paper.
Data sources also vary by unit of analysis, with some sources matching the same unit of analysis used in the paper (as in previous examples), while others are less clear (e.g., “our information on regional minimum wages comes from the Bureau of Labor Statistics.” This should be recorded as “regional minimum wages from the Bureau of Labor Statistics”).
Next, look at the reproduction package and map the data sources mentioned in the paper to the data files in the available materials. Record their folder locations relative to the main reproduction folder1. In addition to looking at the existing data files, we recommend that you review the first lines of all code files (especially cleaning code), looking for lines that call the datasets. Inspecting these scripts may help you understand how different data sources are used, and possibly identify any files that are missing from the reproduction package.
Record this information in this standardized spreadsheet (download it or make a copy for yourself), using the following structure:
Raw data information:
|----------------------|------|-----------------------------------------------|---------------------|---------------------|
| data_source | page | data_files | known_missing | directory |
|----------------------|------|-----------------------------------------------|---------------------|---------------------|
| "Current Population | A1 | cepr_march_2018.dta | | \data\ |
| Survey 2018" | | | | |
|----------------------|------|-----------------------------------------------|---------------------|---------------------|
| "DHS 2010 - 2013" | 4 | nicaraguaDHS_2010.csv; | boliviaDHS_2011.csv | \rawdata\DHS\ |
| | | boliviaDHS_2010.csv; nicaraguaDHS_2011.csv; | | |
| | | nicaraguaDHS_2012.csv; boliviaDHS_2012.csv; | | |
| | | nicaraguaDHS_2013.csv; boliviaDHS_2013.csv | | |
|----------------------|------|-----------------------------------------------|---------------------|---------------------|
| "2017 SAT scores" | 4 | Not available | | \data\to_clean\ |
|----------------------|------|-----------------------------------------------|---------------------|---------------------|
| ... | ... | ... | ... | ... |
|----------------------|------|-----------------------------------------------|---------------------|---------------------|
Note: lists if files in the data_files and known_missing columns should have entries separated by a semi-colon to for the spreadsheet to be compatible with the ACRE Diagram Builder.
List all the analytic files you can find in the reproduction package, and identify their locations relative to the main reproduction folder. Record this information in the standardized spreadsheet.
As you progress through the exercise, add to the spreadsheet a one-line description of each file’s main content (for example: all_waves.csv has the simple description data for region-level analysis). This may be difficult in an initial review, but will become easier as you go along.
The resulting report will have the following structure:
Analysis data information:
|----------------|-----------------------|--------------------------------|
| analysis_data | location | description |
|----------------|-----------------------|--------------------------------|
| final_data.csv | /analysis/fig1/ | data for figure1 |
|----------------|-----------------------|--------------------------------|
| all_waves.csv | /final_data/v1_april/ | data for region-level analysis |
|----------------|-----------------------|--------------------------------|
| ... | ... | ... |
|----------------|-----------------------|--------------------------------|
List all code files that you found in the reproduction package and identify their locations relative to the master reproduction folder. Review the beginning and end of each code file and identify the inputs required to successfully run the file. Inputs may include data sets or other code scripts that are typically found at the beginning of the script (e.g., load, read, source, run, do ). For each code file, record all inputs together and separate each item with “;”. Outputs may include other datasets, figures, or plain text files that are typically at the end of a script (e.g., save, write, export). For each code file, record all outputs together and separate each item with “;”. Provide a one-line description of what each code file does. Record all of this information in the standardized spreadsheet, using the following structure:
Code files information:
|-------------------|------------------|---------------------|---------------------|----------------------|--------------|
| file_name | location | inputs | outputs | description | primary_type |
|-------------------|------------------|---------------------|---------------------|----------------------|--------------|
| output_table1.do | /code/analysis/ | analysis_data01.csv | output1_part1.txt | produces first part | analysis |
| | | | | of table 1 | |
| | | | | (unformatted) | |
|-------------------|------------------|---------------------|---------------------|----------------------|--------------|
| data_cleaning02.R | /code/cleaninig/ | admin_01raw.csv | analysis_data02.csv | removes outliers | cleaning |
| | | | | and missing vals | |
| | | | | from raw admin data | |
|-------------------|------------------|---------------------|---------------------|----------------------|--------------|
| ... | ... | ... | ... | ... | ... |
|-------------------|------------------|---------------------|---------------------|----------------------|--------------|
As you gain an understanding of each code script, you will likely find more inputs and outputs – we encourage you to update the standardized spreadsheet. Once finished with the reproduction exercise, classify each code file as analysis or cleaning. We recognize that this may involve subjective judgment, so we suggest that you conduct this classification based on each script’s main role.
Note: If a code script takes multiple inputs and/or produces multiple outputs they should be listed as semicolon separated lists in order to be compatible with the ACRE Diagram Builder.
Using the information collected above, you can trace your output-to-be-reproduced to its primary sources. Email the standardized spreadsheets from above (sections 2.1.1, 2.1.2 and 2.1.3) to the ACRE Diagram Builder at acre@berkeley.edu. You should receive an email within 24 hours with a reproduction diagram tree that represents the information available on the workflow behind a specific output.
If you were able to identify all the relevant components in the previous section, the ACRE Diagram Builder will produce a tree diagram that looks similar to the one below.
table1.tex
|___[code] analysis.R
|___analysis_data.dta
|___[code] final_merge.do
|___cleaned_1_2.dta
| |___[code] clean_merged_1_2.do
| |___merged_1_2.dta
| |___[code] merge_1_2.do
| |___cleaned_1.dta
| | |___[code] clean_raw_1.py
| | |___raw_1.dta
| |___cleaned_2.dta
| |___[code] clean_raw_2.py
| |___raw_2.dta
|___cleaned_3_4.dta
|___[code] clean_merged_3_4.do
|___merged_3_4.dta
|___[code] merge_3_4.do
|___cleaned_3.dta
| |___[code] clean_raw_3.py
| |___raw_3.dta
|___cleaned_4.dta
|___[code] clean_raw_4.py
|___raw_4.dta
This diagram, built with the information you provided, is already an important contribution to understanding the necessary components required to reproduce a specific output. It summarizes key information to allow for more constructive exchanges with original authors or other reproducers. For example, when contacting the authors for guidance, you can use the diagram to point out specific files you need. Formulating your request this way makes it easier for authors to respond and demonstrates that you have a good understanding of the reproduction package.
In many cases, some of the components of the workflow will not be easily identifiable (or missing) in the reproduction package. Here the Diagram Builder will return a partial reproduction tree diagram. For example, if the files merge_1_2.do, merge_3_4.do, and final_merge.do are missing from the previous diagram, the ACRE Diagram Builder will produce the following diagram:
cleaned_3.dta
|___[code] clean_raw_3.py
|___raw_3.dta
table1.tex
|___[code] analysis.R
|___analysis_data.dta
cleaned_3_4.dta
|___[code] clean_merged_3_4.do
|___merged_3_4.dta
cleaned_1.dta
|___[code] clean_raw_1.py
|___raw_1.dta
cleaned_2.dta
|___[code] clean_raw_2.py
|___raw_2.dta
cleaned_4.dta
|___[code] clean_raw_4.py
|___raw_4.dta
cleaned_1_2.dta
|___[code] clean_merged_1_2.do
|___merged_1_2.dta
Unused data sources: None.
In this case, you can still manually combine this partial information with your knowledge from the paper and own judgement to produce a “candidate” tree diagram (which might lead to different reproducers recreating different diagrams). This may look like the following:
table1.tex
|___[code] analysis.R
|___analysis_data.dta
|___MISSSING CODE FILE(S) #3
|___cleaned_3_4.dta
| |___[code] clean_merged_3_4.do
| |___merged_3_4.dta
| |___MISSSING CODE FILE(S) #2
| |___cleaned_3.dta
| | |___[code] clean_raw_3.py
| | |___raw_3.dta
| |___cleaned_4.dta
| |___[code] clean_raw_4.py
| |___raw_4.dta
|___cleaned_1_2.dta
|___[code] clean_merged_1_2.do
|___merged_1_2.dta
|___MISSSING CODE FILE(S) #1
|___cleaned_1.dta
| |___[code] clean_raw_1.py
| |___raw_1.dta
|
|___cleaned_2.dta
|___[code] clean_raw_2.py
|___raw_2.dta
To leave a record of the reconstructed diagrams, you will have to amend the input spreadsheets using placeholders for the missing components. In the example above, you should add the following entries to the code description spreadsheet:
Adding rows to code spreadsheet:
|-------------------|------------------|---------------------|---------------------|----------------------|--------------|
| file_name | location | inputs | outputs | description | primary_type |
|-------------------|------------------|---------------------|---------------------|----------------------|--------------|
| ... | ... | ... | ... | ... | ... |
|-------------------|------------------|---------------------|---------------------|----------------------|--------------|
| missing_file1 | unknown | cleaned_1.dta; | merged_1_2.dta | missing code | unknown |
| | | cleaned_2.dta | | | |
|-------------------|------------------|---------------------|---------------------|----------------------|--------------|
| missing_file2 | unknown | cleaned_3.dta; | merged_3_4.dta | missing code | unknown |
| | | cleaned_4.dta | | | |
|-------------------|------------------|---------------------|---------------------|----------------------|--------------|
| missing_file3 | unknown | merged_3_4.dta; | analysis_data.dta | missing code | unknown |
| | | merged_1_2.dta | | | |
|-------------------|------------------|---------------------|---------------------|----------------------|--------------|
As in the cases with complete workflows, these diagrams (fragmented or reconstructed trees) provide important information for assessing and improving the reproducibility of specific outputs. Reproducers can compare reconstructed trees and/or contact original authors with highly specific inquiries.
For more examples of diagrams connecting final outputs to initial raw data, see here.
It is possible that not all data included in a replication package are actually used in code scripts in the reproduction package. This would be the case if, for example, the raw data and analysis data are included, but not the script that generates the analysis data. As a concrete example, consider what the original diagram above would look like if the only code included in the reproduction package were analysis.R:
table1.tex
|___[code] analysis.R
|___analysis_data.dta
Unused data sources:
raw_1.dta
raw_2.dta
raw_3.dta
raw_4.dta
Unused analysis data:
cleaned_1.dta
cleaned_2.dta
cleaned_3.dta
cleaned_4.dta
merged_1_2.dta
merged_3_4.dta
cleaned_1_2.dta
cleaned_3_4.dta
In this case, there are many data files that were listed in the raw data and analytic data spreadsheets that are not used by any code script in the replication package.
Once you have identified all possible inputs and have a clear understanding of the connection between the outputs and inputs, you can start to assess the output-specific level of reproducibility.
Take note of the following concepts in this section:
Computationally Reproducible from Analytic data (CRA): The output can be reproduced with minimal effort starting from the analytic datasets.
Computationally Reproducible from Raw data (CRR): The output can be reproduced with minimal effort from the raw datasets.
Minimal effort: One hour or less is required to run the code, not including computing time.
Each level of computational reproducibility is defined by the availability of data and materials, and whether or not the available materials faithfully reproduce the output of interest. The description of each level also includes possible improvements that can help advance the reproducibility of the output to a higher level. You will learn in more detail about the possible improvements.
Note that the assessment is made at the output level – a paper can be highly reproducible for its main results, but suffer from low reproducibility for other outputs. The assessment includes a 10-point scale, where 1 represents that, under current circumstances, reproducers cannot access any reproduction package, while 10 represents access to all the materials and being able to reproduce the target outcome from the raw data.
You will have detected papers that are reproducible at Level 1 as part of the Scoping stage (unsuccessful candidate papers). Make sure to take record them in Survey 1.
Level 2 (L2): Code scripts are available (partial or complete), but no data are available. Possible improvements include adding: raw data (+AD) and analysis data (+RD).
Level 3 (L3): Analytic data and code are partially available, but raw data and cleaning code are not. Possible improvements include: completing analysis data and/or code, adding raw data (+RD), and adding analysis code (+AC).
Level 4 (L4): All analytic data sets and analysis code are available, but code does not run or produces results different than those in the paper (not CRA). Possible improvements include: debugging the analysis code (DAC) or obtaining raw data (+RD).
Level 5 (L5): Analytic data sets and analysis code are available. They produce the same results as presented in the paper (CRA). The reproducibility package may be improved by obtaining the original raw data sets.
This is the highest level that most published research papers can attain currently. Computational reproducibility from raw data is required for papers that are reproducible at Level 6 and above.
Level 6 (L6): Cleaning code is partially available, but raw data is not. Possible improvements include: completing cleaning code (+CC) and/or raw data (+RD).
Level 7 (L7): Cleaning code is available and complete, but raw data is not. Possible improvements include: adding raw data (+RD).
Level 8 (L8): Cleaning code is available and complete, and raw data is partially available. Possible improvements include: adding raw data (+RD).
Level 9 (L9): All the materials (raw data, analytic data, cleaning code, and analysis code) are available. The analysis code produces the same output as presented in the paper (CRA). However, the cleaning code does not run or produces different results that those presented in the paper (not CRR). Possible improvements include: debugging the cleaning code (DCC).
Level 10 (L10): All the materials are available and produce the same results as presented in the paper with minimal effort, starting from the analytic data (yes CRA) or the raw data (yes CRR). Note that Level 10 is aspirational and may be very difficult to attain for most research published today.
The following figure summarizes the different levels of computational reproducibility (for any given output). For each level, there will be improvements that have been made (✔) or can be made to move up one level of reproducibility (-).
Levels of Computational Reproducibility
(P denotes "partial", C denotes "complete")
| Availability of materials, and reproducibility |
|------------------------------------------------|
|Analysis| Analysis| | Cleaning| Raw | |
|Code | Data | CRA | Code | Data | CRR |
| P | C | P | C | | P | C | P | C | |
---------|---------|-----|---------|-------|-----|
L1: No materials.................| - - | - - | - | - - | - - | - |
---------------------------------|--------|---------|-----|---------|-------|-----|
L2: Only code ...................| ✔ ✔ | - - | - | - - | - - | - |
L3: Partial analysis data & code.| ✔ ✔ | ✔ - | - | - - | - - | - |
L4: All analysis data & code.....| ✔ ✔ | ✔ ✔ | - | - - | - - | - |
L5: Reproducible from analysis...| ✔ ✔ | ✔ ✔ | ✔ | - - | - - | - |
---------------------------------|--------|---------|-----|---------|-------|-----|
L6: Some cleaning code...........| ✔ ✔ | ✔ ✔ | ✔ | ✔ - | - - | - |
L7: All cleaning code............| ✔ ✔ | ✔ ✔ | ✔ | ✔ ✔ | - - | - |
L8: Some raw data................| ✔ ✔ | ✔ ✔ | ✔ | ✔ ✔ | ✔ - | - |
L9: All raw data.................| ✔ ✔ | ✔ ✔ | ✔ | ✔ ✔ | ✔ ✔ | - |
L10:Reproducible from raw data...| ✔ ✔ | ✔ ✔ | ✔ | ✔ ✔ | ✔ ✔ | ✔ |
You may disagree with some of the levels outlined above, particularly wherever subjective judgment may be required. If so, you are welcome to interpret the levels as unordered categories (independent from their sequence) and suggest improvements using the “Edit” button above (top left corner if you are reading this document in your browser).
A large portion of published research in economics uses confidential or proprietary data, most often government data from tax records or service provision and what is generally referred to as administrative data. Since administrative and proprietary data are rarely publicly accessible, some of the reproducibility levels presented above only apply once modified. The underlying theme of these modifications is that when data cannot be provided, you can assign a reproducibility score based on the level of detail in the instructions for accessing the data. Similarly, when reproducibility cannot be verified based on publicly available materials, the reproduction materials should demonstrate that a competent and unbiased third party (not involved in the original research team) has been able to reproduce the results.
Levels 1 and 2 can be applied as described above.
Adjusted Level 4 (L4*): All analysis code is provided, and complete and detailed instructions on how to access the analysis data are available.
Adjusted Level 5 (L5*): All requirements for Level 4* are met, and the authors provide a certification that the output can be reproduced from the analysis data (CRA) by a third party. Examples include a signed letter by a disinterested reproducer or an official reproducibility certificate from a certification agency for data and code (e.g., see cascad).
Levels 6 and 7 can be applied as described above.
Adjusted Level 8 (L8*): All requirements for Level 7* are met, but instructions for accessing the raw data are incomplete. Use the instructions described in Level 3 above to assess the instructions’ completeness.
Adjusted Level 9 (L9*): All requirements for Level 8* are met, and instructions for accessing the raw data are complete.
Adjusted Level 10 (L10*): All requirements for Level 9* are met, and a certification that the output can be reproduced from the raw data is provided.
Levels of Computational Reproducibility with Proprietary/Confidential Data
(P denotes "partial", C denotes "complete")
| Availability of materials, and reproducibility |
|------------------------------------------------|
| | Instr. | | | Instr.| |
|Analysis| Analysis| | Cleaning| Raw | |
|Code | Data | CRA | Code | Data | CRR |
| P | C | P | C | | P | C | P | C | |
---------|---------|-----|---------|-------|-----|
L1: No materials.................| - - | - - | - | - - | - - | - |
---------------------------------|--------|---------|-----|---------|-------|-----|
L2: Only code ...................| ✔ ✔ | - - | - | - - | - - | - |
L3: Partial analysis data & code.| ✔ ✔ | ✔ - | - | - - | - - | - |
L4*: All analysis data & code....| ✔ ✔ | ✔ ✔ | - | - - | - - | - |
L5*: Proof of third party CRA....| ✔ ✔ | ✔ ✔ | ✔ | - - | - - | - |
---------------------------------|--------|---------|-----|---------|-------|-----|
L6: Some cleaning code...........| ✔ ✔ | ✔ ✔ | ✔ | ✔ - | - - | - |
L7: All cleaning code............| ✔ ✔ | ✔ ✔ | ✔ | ✔ ✔ | - - | - |
L8*: Some instr. for raw data....| ✔ ✔ | ✔ ✔ | ✔ | ✔ ✔ | ✔ - | - |
L9*: All instr. for raw data.....| ✔ ✔ | ✔ ✔ | ✔ | ✔ ✔ | ✔ ✔ | - |
L10*:Proof of third party CRR....| ✔ ✔ | ✔ ✔ | ✔ | ✔ ✔ | ✔ ✔ | ✔ |In addition to the output-specific assessment and improvement of computational reproducibility, several practices can facilitate reproducibility at the level of the overall paper. You can read about such practices in greater detail in the next chapter, dedicated to Stage 3: Improvements. In this Assessment section, you should only verify whether the original reproduction package made use of any of the following:
Congratulations! You have now completed the Assessment stage of this exercise. You have provided a concrete building block of knowledge to improve understanding of the state of reproducibility in Economics.
Please continue to the next section where you can help improve it!
library(tidyverse)
## ── Attaching packages ───────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.1 ✓ purrr 0.3.4
## ✓ tibble 3.0.1 ✓ dplyr 1.0.0
## ✓ tidyr 1.1.0 ✓ stringr 1.4.0
## ✓ readr 1.3.1 ✓ forcats 0.5.0
## ── Conflicts ──────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(knitr)
library(kableExtra)
##
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
##
## group_rows
After assessing the paper’s reproducibility package, you can start proposing ways to improve its reproducibility. Making improvements provides an opportunity to gain a deeper understanding of the paper’s methods, findings, and overall contribution. Each contribution can also be assessed and used by the wider ACRE community, including other students and researchers using the ACRE platform.
As with the Assessment section, we recommend that you first focus on one specific display item (e.g., “Table 1”). After making improvements to this first item, you will have a much easier time translating those improvements to other ones.
Use Survey 2 to record your work as part of this step.
Reproduction packages often do not include all original raw datasets. To obtain any missing raw data, or information about them, follow these steps:
data_source in this standarized spreadsheet). However, some data sources (as collected by the original investigators) might be missing one or more data files. You can sometimes find the specific name of those files by looking at the beginning of the cleaning code scripts. If you find the name of the file, record it in the known_,missing field of the same spreadsheet as above. If not, record it as “Some/All” in the known_,missing field of the for each specific data source.In addition to trying to obtain the raw data, you can also contribute by obtaining missing analytic data.
Analytic data can be missing for two reasons: (i) raw data exists, but the procedures to transform it into analytic data are not fully reproducible, or (ii) some or all raw data is missing, and some or all analytic data is not included in the original reproduction package. To obtain any missing analytic data, follow these steps:
analysis_data_03.csv).Analysis code can be added when analytic data files are available, but some or all methodological steps are missing from the code. In this case, follow these steps:
Identify the specific line or paragraph in the paper that describes the analytic step that is missing from the code (e.g., “We impute missing values to…” or “We estimate this regression using a bandwidth of…”).
Identify the code file and the approximate line in the script where the analysis can be carried out. If you cannot find the relevant code file, identify its location relative to the main folder using the the steps in the reproduction diagram.
Use the ACRE database to verify if previous attempts have been made to contact the authors about this issue.
Contact the authors and request the specific code files.
If step #4 does not work, we encourage you to attempt to recreate the analysis using your own interpretation of the paper, and making explicit your assumptions when filling in any gaps.
Data cleaning (processing) code might be added when steps are missing in the creation or re-coding of variables, merging, subsetting of the data sets, or other steps related to data cleaning and processing. You should follow the same steps you used when adding missing analysis code (1-5).
Whenever code is available in the reproduction package, you should be able to debug those scripts. There are four types of debugging that can improve the reproduction package:
Follow the same steps that you did to debug the analysis code, but report them separately.
Track all the different types of improvements you make and record in this standarized spreadsheet with the following structure:
| output_name | imprv | description_of_added_files | lvl |
|---|---|---|---|
| table 1 | +AD | ADD EXAMPLES | 5 |
| table 1 | +RD | ADD EXAMPLES | 5 |
| table 1 | DCC | ADD EXAMPLES | 5 |
| figure 1 | +CC | 6 | |
| figure 1 | DAC | 6 | |
| inline 1 | DAC | 8 | |
| … | … | … | … |
Level-specific quality improvements: add data/code, debug code.
| output_name | imprv | description_of_added_files | lvl |
|-------------|-------|-----------------------------------|-----|
| table 1 | +AD | ADD EXAMPLES | 5 |
| table 1 | +RD | ADD EXAMPLES | 5 |
| table 1 | DCC | ADD EXAMPLES | 5 |
| figure 1 | +CC | | 6 |
| figure 1 | DAC | | 6 |
| inline 1 | DAC | | 8 |
| ... | ... | ... | ... |
There are at least six additional improvements you can make to improve a paper’s overall reproducibility. These additional improvements can be applied across all reproducibility levels (including level 10).
You will be asked to provide this information in the Assessment and Improvement Survey.
library(tidyverse)
## ── Attaching packages ───────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.1 ✓ purrr 0.3.4
## ✓ tibble 3.0.1 ✓ dplyr 1.0.0
## ✓ tidyr 1.1.0 ✓ stringr 1.4.0
## ✓ readr 1.3.1 ✓ forcats 0.5.0
## ── Conflicts ──────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(knitr)
library(kableExtra)
##
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
##
## group_rows
temp_eval <- FALSE
Once you have assessed and improved the computational reproducibility of a paper, you can assess the quality of different analytical choices by including new robustness checks in addition to those included in the original paper. We use the term robustness checks to describe any possible change in a computational choice, both in data analysis and data cleaning, and its subsequent effect on the main estimates of interest. The universe of robustness checks can be very large or potentially infinite. The focus should be on the set of reasonable specifications (Simonsohn et. al., 2018), defined as (1) sensible tests of the research question, (2) expected to be statistically valid, and (3) not redundant with other specifications in the set.
The addition of new robustness checks will depend on the current level of reproducibility. E.g., for claims supported by display items reproducible at level 0-1, it is not possible to perform any other robustness checks in addition to what is already in the paper (??? include a brief explanation why: because…). It may be possible to perform additional robustness checks for claims supported by display items reproducible at levels 2-4, but not using the specific estimates declared in Stage 1: Scoping because the display items are not computationally reproducible from analysis data (CRA). It is possible to include additional robustness checks to validate the core conclusion of a claim based on a display item reproducible at level 5. Finally, a claim associated with display items reproducible at level 6 and above allows for robustness checks that involve variable definitions and data manipulations. When checking the robustness to a new variable definition, reproducers will also have the possibility of testing how the main estimate changes under an alternative variable definition and an alternative core analytical choice. (??? please verify whether this is what you meant to say in the last 2 sentences)
Going back to our diagram that represents the multiple parts of a paper (0.1), the robustness section begins at the claim level. For a given claim, there will be several specifications presented in the paper, one of which is identified by the authors (or yourself, in the absence of one designated by the authors) as the main or preferred specification. Identify which display item contains this specification and refer to the reproduction tree to identify the code files where you can potentially modify a computational choice. Using the example tree discussed in the Assessment stage, we can remove the data files for simplicity and obtain the following:
table1.tex (contains preferred specification of a given claim)
|___[code] analysis.R
|___[code] final_merge.do
|___[code] clean_merged_1_2.do
| |___[code] merge_1_2.do
| |___[code] clean_raw_1.py
| |___[code] clean_raw_2.py
|___[code] clean_merged_3_4.do
|___[code] merge_3_4.do
|___[code] clean_raw_3.py
|___[code] clean_raw_4.py
This simplified tree gives you a list of potential files where you could test different reasonable specifications. Here we suggest two types of contributions to robustness checks: i) mapping the universe of robustness checks and ii) testing reasonable specifications. Both contributions should be recorded in the ACRE platform referring to files in a specific reproduction package.
Analytical choices in data cleaning code
- Variable definition
- Data sub-setting
- Data reshaping (merge, append, long/gather, wide/spread)
- Others (specify as “processing - other”)
Analytical choices in analysis code
- Regression function (link function)
- Key parameters (tuning, tolerance parameters, etc.)
- Controls
- Adjustment of standard errors
- Choice of weights
- Treatment of missing values
- Imputations
- Other (specify as “methods - other”)
Once finished, transcribe all of the information on analytical choices into a dataset (the ACRE platform will allow for easier recording once deployed). For the source field type “original” whenever the analytical choice is identified for the first time, and file_name-line number every subsequent time when the same analytical choice is applied (for example if an analytic choice is identified for the first time in line #103 and for the second time in line #122 their respective values for the source field should be original and code_01.do-L103, respectively).
For each analytical choice recorded, add the specific choice that the paper used, and describe what other alternatives could have been used. The resulting database should have the following structure:
| entry_id | file_name | line_number | choice_type | choice_value | choice_range | Source |
|---|---|---|---|---|---|---|
| 1 | code_01.do | 73 | data sub-setting | males | males, female, | original |
| 2 | code_01.do | 122 | variable definition | income = wages + capital gains | wages, capital gains, gifts | “code_01.do-L103” |
| 3 | code_05.R | 143 | controls | age, income, education | age, income, education, region | original |
| … | … | … | … | … | … | … |
The advantage of this type of contribution is that you are not required to have an in-depth knowledge of the paper and its methodology to contribute. This allows you to potentially map several code files, achieving a broader understanding of the paper. The disadvantage is that you are not expected to test alternative specifications.
When performing a specific robustness test, follow these steps:
Search in the mapping database (previous section) (???, is this referring to the reproduction tree diagram?) and record the identifier(s) corresponding to the analytical choice to test (entry_id). If there is no entry corresponding for the specific lines, please create one.
Propose a specific variation to this analytical choice.
Discuss whether you think this variation is sensible, specifically in the context of the claim tested (e.g. does it make sense to include exclude low-income Hispanics from the sample?).
Discuss how this variation could affect the validity of the results (e.g. likely effects on omitted variable bias, measurement error, change in the Local Average Treatment Effects for the underlying population).
Confirm that test is not redundant with other tests in the paper/robustness exercise.
Report the results from the robustness check (new estimate, standard error, and units).
The advantage of this approach is that it allows for an in-depth inspection of a specific section of the paper. The main limitation is that justifying sensibility and validity (and non-redundancy, to some extent) requires a much deeper understanding of the topic and the methods of the paper, making it less feasible for undergraduate students or graduates with only a general interest in the paper. (??? what does it mean to have only a general interest in the paper?)
table 1
└───[code] formatting_table1.R
├───output1_part1.txt
| └───[code] output_table1.do
| └───[data] analysis_data01.csv
| └───[code] data_cleaning01.R
| └───[data] survey_01raw.csv
└───output1_part2.txt
└───[code] output_table2.do
└───[data] analysis_data02.csv
└───[code] data_cleaning02.R
└───[data] admin_01raw.csv
table 1
└───[code] formatting_table1.R
├───output1_part1.txt
| └───[code] output_table1.do
| └───[data] analysis_data01.csv
| └───[code] MISSING FILE(S)
| └───[data] survey_01raw.csv
└───output1_part2.txt
└───[code] output_table2.do
└───[data] analysis_data02.csv
└───[code] MISSIN FILE(S)
└───[data] admin_01raw.csv
Create a section with short summaries of great resources for comp. repro and invite reader to contribute.
TODO: Add and classify
The ACRE project welcomes feedback from participants and the wider social science community. If you wish to provide feedback on specific chapters or sections, click the “edit” icon at the top of this page (this will prompt you to sign into or create a GitHub account, after which you’ll be able to “commit” changes directly to the text. “Committed” changes will be reviewed by the ACRE project team before “pushing” them to these guidelines or not. For more general feedback, please contact ACRE@berkeley.edu.
Major contributions to these guidelines will be acknowledged below. The ACRE project employs the Contributor Roles Taxonomy (CRediT). Major contributions are defined as any pushed revisions to the guideline language or source code beyond corrections of spelling and grammar.
Aleksandar Bogdanoski – Funding acquisition, Project administration, Writing (original draft), Writing (reviewing and editing) Carson Christiano – Funding acquisition, Project administration, Writing (reviewing and editing) Joel Ferguson – Writing (original draft), Writing (reviewing and editing) Fernando Hoces de la Guardia – Conceptualization, Funding acquisition, Writing (original draft), Writing (reviewing and editing) Katherine Hoeberling – Funding acquisition, Project administration, Writing (original draft), Writing (reviewing and editing) Edward Miguel – Conceptualization, Funding acquisition, Supervision Emma Ng – Visualization, Writing (original draft), Writing (reviewing and editing) Lars Vilhuber – Conceptualization, Funding acquisition, Supervision
Raw data are unmodified files obtained by the authors from the sources cited in the paper. Raw data from which personally identifiable information (PII) has been removed is still considered raw. All other modifications to raw data make it processed. A data set may be classified as raw if it fits any of the following criteria:
Causal claim: This paper estimates the effect of X on Y for population P, using method F. Example: “This paper investigates the impact of bicycle provision on secondary school enrollment among young women in Bihar/India, using a Difference in Difference approach.”
Descriptive/predictive claim: This paper estimates the value of Y (estimated or predicted) for population P under dimensions X using method M. Example: “Drawing on a unique Swiss data set (population P) and exploiting systematic anomalies in countries’ portfolio investment positions (method M), I find that around 8% of the global financial wealth of households is held in tax havens (value of Y)”
Reproduction package: Collection of all the materials associated with the reproduction of a paper. A reproduction package may contain data, code and documentation. When the materials are provided in the original publication they will be labeled as ‘original reproduction package’, when they provided by a previous reproducer they will be referred as ‘reproducer X’s reproduction package’. At this point you are only assessing the existence of one (or more) reproduction packages, you are will not be assessing the quality of its content at this stage.
Chang, Andrew, and Phillip Li. 2015. “Is Economics Research Replicable? Sixty Published Papers from Thirteen Journals Say’usually Not’.” Available at SSRN 2669564.
Christensen, Garret, Jeremy Freese, and Edward Miguel. 2019. Transparent and Reproducible Social Science Research: How to Do Open Science. University of California Press.
Galiani, S, P Gertler, and M Romero. 2018. “How to Make Replication the Norm.” Nature 554 (7693): 417–19.
King, Gary. 1995. “Replication, Replication.” PS: Political Science and Politics 28: 444–52.
Kingi, Hautahi, Lars Vilhuber, Sylverie Herbert, and Flavio Stanchi. 2018. “The Reproducibility of Economics Research: A Case Study.” In. Presented at the BITSS Annual Meeting 2018; available at the Open Science ….
a relative location takes the form of /folder_in_rep_materials/sub_folder/file.txt, in contrast to an absolute location that takes the form of username/documents/projects/repros/folder_in_rep_materials/sub_folder/file.txt↩